bounceR:
Automated Feature Selection for Machine Learning Algorithms
We are a consulting company for data science, machine learning and statistics with offices in Frankfurt, Zurich and Stuttgart. We support our customers in the development and implementation of data science and machine learning solutions.
Data science projects often follow a similiar structure. At the very beginning, one must load and prep the data, of course. Everything afterwards is fun, the first two parts are not.
Currently there are two main ways to select the relevant features out of the entire feature space:
Componentwise Gradient Boosting is a boosting ensemble algorithm allowing to discriminate the relevance of features. In its essence, the method follows this algorithm:
Find a feature selection algorithm that can distinguish relevant from irrelevant features without overfitting the training data.
Each round a random stability score distribution is initialized. Over the course of \( m \) models, the distribution is adjusted. Essentially our code follows the algorithm:
Essentially we take bits form cool algorithms and put them together. For once, we leverage the complete randomness of random forests. Additionally we apply a somewhat transformed idea of backpropagation.
Sure, you have a lot of tuning parameters, however we put them all together in a nice and handy little interface. By the way, we set the defaults based on several simulation studies, so you can - sort of - trust them - sometimes.
# Feature Selection using bounceR-----------------------------------------------------
selection <- featureSelection(data = train_df,
target = "target",
index = NULL,
selection = selectionControl(n_rounds = 100,
n_mods = 1000,
p = NULL,
reward = 0.2,
penalty = 0.3,
max_features = NULL),
bootstrap = "regular",
boosting = boostingControl(mstop = 100, nu = 0.1),
early_stopping = "aic",
n_cores = 6)
The package is still under developmet and not yet listed on CRAN. However you can get it from GitHub.
# load devtools
install.packages(devtools)
library(devtools)
# download from our public repo
devtools::install_github("STATWORX/bounceR")
# source it
library(bounceR)
If you find any bugs or spot anything that is not super convenient, just open an issue.
The package contains a variety of useful functions surrounding the topic of feature selection, such as:
sim_data: a function simulating regression and classification data, where the true feature space is knownfeatureFiltering: a function implementing several popular filter methods for feature selectionfeatureSelection: a function implementing our home grown algorithm for feature selectionprint.sel_obj: an S4 priniting method for the object class “sel_obj”plot.sel_obj: an S4 ploting method for the object class “sel_obj”summary.sel_obj: an S4 summary method for the object class “sel_obj”builder: method to extract a formula with n features from a “sel_obj”If you have any questions, are interested or have an idea, just contact us!